tsibble objectsHow to create a tsibble:
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ tibble 3.1.5 ✓ tsibble 1.0.1
## ✓ dplyr 1.0.7 ✓ tsibbledata 0.3.0
## ✓ tidyr 1.1.4 ✓ feasts 0.2.2
## ✓ lubridate 1.8.0 ✓ fable 0.3.1
## ✓ ggplot2 3.3.5
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
y <- tsibble(
Year = 2015:2019,
Observation = c(123, 39, 78, 52, 110),
index = Year
)
y
## # A tsibble: 5 x 2 [1Y]
## Year Observation
## <int> <dbl>
## 1 2015 123
## 2 2016 39
## 3 2017 78
## 4 2018 52
## 5 2019 110
For observations that are more frequent than once per year, we need to use a time class function on the index. For example:
olympic_running
## # A tsibble: 312 x 4 [4Y]
## # Key: Length, Sex [14]
## Year Length Sex Time
## <int> <int> <chr> <dbl>
## 1 1896 100 men 12
## 2 1900 100 men 11
## 3 1904 100 men 11
## 4 1908 100 men 10.8
## 5 1912 100 men 10.8
## 6 1916 100 men NA
## 7 1920 100 men 10.8
## 8 1924 100 men 10.6
## 9 1928 100 men 10.8
## 10 1932 100 men 10.3
## # … with 302 more rows
The 14 time series in this object are uniquely identified by the keys: the Length and Sex variables. The distinct() function can be used to show the categories of each variable or even combinations of variables:
olympic_running %>% distinct(Sex)
## # A tibble: 2 × 1
## Sex
## <chr>
## 1 men
## 2 women
We can use dplyr functions such as mutate(), filter(), select() and summarise() to work with tsibble objects.
PBS
## # A tsibble: 67,596 x 9 [1M]
## # Key: Concession, Type, ATC1, ATC2 [336]
## Month Concession Type ATC1 ATC1_desc ATC2 ATC2_desc Scripts Cost
## <mth> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1991 Jul Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 18228 67877
## 2 1991 Aug Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 15327 57011
## 3 1991 Sep Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 14775 55020
## 4 1991 Oct Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 15380 57222
## 5 1991 Nov Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 14371 52120
## 6 1991 Dec Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 15028 54299
## 7 1992 Jan Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 11040 39753
## 8 1992 Feb Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 15165 54405
## 9 1992 Mar Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 16898 61108
## 10 1992 Apr Concession… Co-pa… A Alimentary… A01 STOMATOLOG… 18141 65356
## # … with 67,586 more rows
Filter the dataset to extract the A10 scripts:
a10 <- PBS %>% filter(ATC2 == "A10")
a10
## # A tsibble: 816 x 9 [1M]
## # Key: Concession, Type, ATC1, ATC2 [4]
## Month Concession Type ATC1 ATC1_desc ATC2 ATC2_desc Scripts Cost
## <mth> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1991 Jul Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 89733 2.09e6
## 2 1991 Aug Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 77101 1.80e6
## 3 1991 Sep Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 76255 1.78e6
## 4 1991 Oct Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 78681 1.85e6
## 5 1991 Nov Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 70554 1.69e6
## 6 1991 Dec Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 75814 1.84e6
## 7 1992 Jan Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 64186 1.56e6
## 8 1992 Feb Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 75899 1.73e6
## 9 1992 Mar Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 89445 2.05e6
## 10 1992 Apr Concession… Co-pa… A Alimentary… A10 ANTIDIABE… 97315 2.23e6
## # … with 806 more rows
Select the columns we need in our analysis:
a10 <- a10 %>% select(Month, Concession, Type, Cost)
a10
## # A tsibble: 816 x 4 [1M]
## # Key: Concession, Type [4]
## Month Concession Type Cost
## <mth> <chr> <chr> <dbl>
## 1 1991 Jul Concessional Co-payments 2092878
## 2 1991 Aug Concessional Co-payments 1795733
## 3 1991 Sep Concessional Co-payments 1777231
## 4 1991 Oct Concessional Co-payments 1848507
## 5 1991 Nov Concessional Co-payments 1686458
## 6 1991 Dec Concessional Co-payments 1843079
## 7 1992 Jan Concessional Co-payments 1564702
## 8 1992 Feb Concessional Co-payments 1732508
## 9 1992 Mar Concessional Co-payments 2046102
## 10 1992 Apr Concessional Co-payments 2225977
## # … with 806 more rows
Summarise allows us to combine data across keys:
a10 <- a10 %>% summarise(TotalCost = sum(Cost))
a10
## # A tsibble: 204 x 2 [1M]
## Month TotalCost
## <mth> <dbl>
## 1 1991 Jul 3526591
## 2 1991 Aug 3180891
## 3 1991 Sep 3252221
## 4 1991 Oct 3611003
## 5 1991 Nov 3565869
## 6 1991 Dec 4306371
## 7 1992 Jan 5088335
## 8 1992 Feb 2814520
## 9 1992 Mar 2985811
## 10 1992 Apr 3204780
## # … with 194 more rows
Create new variables using the mutate function:
a10 <- a10 %>% mutate(TotalCost = TotalCost / 1e6)
a10
## # A tsibble: 204 x 2 [1M]
## Month TotalCost
## <mth> <dbl>
## 1 1991 Jul 3.53
## 2 1991 Aug 3.18
## 3 1991 Sep 3.25
## 4 1991 Oct 3.61
## 5 1991 Nov 3.57
## 6 1991 Dec 4.31
## 7 1992 Jan 5.09
## 8 1992 Feb 2.81
## 9 1992 Mar 2.99
## 10 1992 Apr 3.20
## # … with 194 more rows
prison <- readr::read_csv('https://OTexts.com/fpp3/extrafiles/prison_population.csv')
## Rows: 3072 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): State, Gender, Legal, Indigenous
## dbl (1): Count
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prison <- prison %>%
mutate(Quarter = yearquarter(Date)) %>%
select(-Date) %>%
as_tsibble(key = c(State, Gender, Legal, Indigenous),
index = Quarter)
prison
## # A tsibble: 3,072 x 6 [1Q]
## # Key: State, Gender, Legal, Indigenous [64]
## State Gender Legal Indigenous Count Quarter
## <chr> <chr> <chr> <chr> <dbl> <qtr>
## 1 ACT Female Remanded ATSI 0 2005 Q1
## 2 ACT Female Remanded ATSI 1 2005 Q2
## 3 ACT Female Remanded ATSI 0 2005 Q3
## 4 ACT Female Remanded ATSI 0 2005 Q4
## 5 ACT Female Remanded ATSI 1 2006 Q1
## 6 ACT Female Remanded ATSI 1 2006 Q2
## 7 ACT Female Remanded ATSI 1 2006 Q3
## 8 ACT Female Remanded ATSI 0 2006 Q4
## 9 ACT Female Remanded ATSI 0 2007 Q1
## 10 ACT Female Remanded ATSI 1 2007 Q2
## # … with 3,062 more rows
For time series data, the obvious graph to start with is a time plot. That is, the observations are plotted against the time of observation, with consecutive observations joined by straight lines. Figure 2.1 shows the weekly economy passenger load on Ansett airlines between Australia’s two largest cities.
melsyd_economy <- ansett %>%
filter(Airports == "MEL-SYD", Class == "Economy") %>%
mutate(Passengers = Passengers / 1000)
autoplot(melsyd_economy, Passengers) +
labs(title = "Ansett airline economy class",
subtitle = "Melbourne-Sydney",
y = "Passengers ('000)")
The time plot immediately reveals some interesting features.
• There was a period in 1989 when no passengers were carried — this was due to an industrial dispute.
• There was a period of reduced load in 1992. This was due to a trial in which some economy class seats were replaced by business class seats.
• A large increase in passenger load occurred in the second half of 1991.
• There are some large dips in load around the start of each year. These are due to holiday effects. • There is a long-term fluctuation in the level of the series which increases during 1987, decreases in 1989, and increases again through 1990 and 1991.
• There are some periods of missing observations.
Let’s take a look at the a10 data saved earlier, plotted as a time series:
autoplot(a10, TotalCost) +
labs(y = "$ millionss",
title = "Australian antidiabetic drug sales")
Trend
A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as “changing direction,” when it might go from an increasing trend to a decreasing trend. There is a trend in the antidiabetic drug sales data shown in Figure 2.2.
Seasonal
A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known period. The monthly sales of antidiabetic drugs (Figure 2.2) shows seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.
Cyclic
A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle.” The duration of these fluctuations is usually at least 2 years.
A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. An example is given in Figure 2.4 showing the antidiabetic drug sales.
a10 %>%
gg_season(TotalCost, labels = "both") +
labs(y = "$ millions",
title = "Seasonal plot: Antidiabets drug sales") +
expand_limits(x = ymd(c("1972-12-28", "1973-12-04")))
Where the data has more than one seasonal pattern, the period argument can be used to select which seasonal plot is required. The vic_elec data contains half-hourly electricity demand for the state of Victoria, Australia. We can plot the daily pattern, weekly pattern or yearly pattern by specifying the period argument.
vic_elec %>% gg_season(Demand, period = "day") +
theme(legend.position = "none") +
labs(y = "MW", title = "Electricity demand: Victoria")
vic_elec %>% gg_season(Demand, period = "week") +
theme(legend.position = "none") +
labs(y = "MW", title = "Electricity demand: Victoria by day of the week")
vic_elec %>% gg_season(Demand, period = "year") +
labs(y="Megawatts", title = "Electricity demand: Victoria")
An alternative plot that emphasises the seasonal patterns is where the data for each season are collected together in separate mini time plots.
a10 %>%
gg_subseries(TotalCost) +
labs(
y = "$ millions",
title = "Australian antidiabetic drug sales"
)
holidays <- tourism %>%
filter(Purpose == "Holiday") %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
holidays
## # A tsibble: 640 x 3 [1Q]
## # Key: State [8]
## State Quarter Trips
## <chr> <qtr> <dbl>
## 1 ACT 1998 Q1 196.
## 2 ACT 1998 Q2 127.
## 3 ACT 1998 Q3 111.
## 4 ACT 1998 Q4 170.
## 5 ACT 1999 Q1 108.
## 6 ACT 1999 Q2 125.
## 7 ACT 1999 Q3 178.
## 8 ACT 1999 Q4 218.
## 9 ACT 2000 Q1 158.
## 10 ACT 2000 Q2 155.
## # … with 630 more rows
autoplot(holidays, Trips) +
labs(y = "Overnight trips '000",
title = "Australian domestic holidays") +
facet_grid(~State)
To see the timing of the seasonal peaks in each state, we can use a season plot.
gg_season(holidays, Trips) +
labs(y = "Overnight trips '000",
title = "Australian domestic holidays")
holidays %>%
gg_subseries(Trips) +
labs(y = "Overnight trips '000",
titla = "Australian domestic holidays")
The graphs discussed so far are useful for visualising individual time series. It is also useful to explore relationships between time series.
vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Demand)+
labs(y = "Gigawatts",
title = "Half-hourly electricity demand: Victoria")
vic_elec %>%
filter(year(Time) == 2014) %>%
autoplot(Temperature) +
labs(
y = "Degrees Celsius",
title = "Half-hourly temperatures: Melbourne, Australia"
)
vic_elec %>%
filter(year(Time) == 2014) %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point() +
labs(x = "Temperature (degrees Celsius)",
title = "Electricity demand (gigawatts)")
\[r = \frac{\sum_{i=0}^n\left(x_t - \bar{x} \right)\left(y_t - \bar{y} \right)^2}{\sqrt{\left(x_t - \bar{x} \right)^2}\sqrt{\left(y_t - \bar{y} \right)^2}}\]
When there are several potential predictor variables, it is useful to plot each variable against each other variable.
visitors <- tourism %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
visitors %>%
ggplot(aes(x = Quarter, y = Trips)) +
geom_line() +
facet_grid(vars(State), scales = "free_y") +
labs(title = "Australian domestic tourism",
y = "Overnight trips '000")
To see the relationships between these eight time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, as shown in Figure 2.18. (This plot requires the GGally package to be installed.)
visitors %>%
pivot_wider(values_from = Trips, names_from = State) %>%
GGally::ggpairs(columns = 2:9)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
For each panel, the variable on the vertical axis is given by the variable name in that row, and the variable on the horizontal axis is given by the variable name in that column. There are many options available to produce different plots within each panel. In the default version, the correlations are shown in the upper right half of the plot, while the scatterplots are shown in the lower half. On the diagonal are shown density plots.
The value of the scatterplot matrix is that it enables a quick view of the relationships between all pairs of variables. In this example, mostly positive relationships are revealed, with the strongest relationships being between the neighbouring states located in the south and south east coast of Australia, namely, New South Wales, Victoria and South Australia. Some negative relationships are also revealed between the Northern Territory and other regions. The Northern Territory is located in the north of Australia famous for its outback desert landscapes visited mostly in winter. Hence, the peak visitation in the Northern Territory is in the July (winter) quarter in contrast to January (summer) quarter for the rest of the regions.
Figure 2.19 displays scatterplots of quarterly Australian beer production (introduced in Figure 1.1), where the horizontal axis shows lagged values of the time series. Each graph shows \[y_t \textrm{ is plotted against } y_{t-k} \textrm{ for different values of k}\]
recent_production <- aus_production %>%
filter(year(Quarter) >= 2000)
recent_production %>%
gg_lag(Beer, geom = "point") +
labs(x = "lab(Beerk, k")
Here the colours indicate the quarter of the variable on the vertical axis. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data. The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)
Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.
\[r_k = \frac{\sum_{t = k+1}^T\left(y_t - \bar{y} \right)\left(y_{t-k} - \bar{y} \right)}{\sum_{t = 1}^T{\left(y_t - \bar{y} \right)^2}}\]
where T is the length of the time series. The autocorrelation coefficients make up the autocorrelation function or ACF.
The autocorrelation coefficients for the beer production data can be computed using the ACF() function.
recent_production %>% ACF(Beer, lab_max = 9)
## Warning: The `...` argument of `PACF()` is deprecated as of feasts 0.2.2.
## ACF variables should be passed to the `y` argument. If multiple variables are to be used, specify them using `vars(...)`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: ACF currently only supports one column, `Beer` will be used.
## # A tsibble: 16 x 2 [1Q]
## lag acf
## <lag> <dbl>
## 1 1Q -0.0530
## 2 2Q -0.758
## 3 3Q -0.0262
## 4 4Q 0.802
## 5 5Q -0.0775
## 6 6Q -0.657
## 7 7Q 0.00119
## 8 8Q 0.707
## 9 9Q -0.0888
## 10 10Q -0.588
## 11 11Q 0.0241
## 12 12Q 0.632
## 13 13Q -0.0660
## 14 14Q -0.505
## 15 15Q -0.00891
## 16 16Q 0.498
The values in the acf column are \[r_1...r_9\], corresponding to the nine scatterplots in Figure 2.19. We usually plot the ACF to see how the correlations change with the lag
k. The plot is sometimes known as a correlogram.
recent_production %>%
ACF(Beer) %>%
autoplot() + labs(title = "Australian beer production")
In this graph: • \[r_4\] is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be four quarters apart. • \[r_2\] is more negative than for the other lags because troughs tend to be two quarters behind peaks. The dashed blue lines indicate whether the correlations are significantly different from zero (as explained in Section 2.9).
When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.
a10 %>%
ACF(TotalCost, lag_max = 48) %>%
autoplot() +
labs(title = "Australian antidiabetic drug sales")
Time series that show no autocorrelation are called white noise. Figure 2.22 gives an example of a white noise series.
set.seed(30)
y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)
y %>% autoplot(wn) + labs(title = "White noise", y = "")
y %>%
ACF(wn) %>%
autoplot() + labs(title = "White noise")
Completion of text for chapter 2.